Detailed and precise hierarchical design planning is essential to achieving closure on large designs. In this article we describe a new hierarchical design flow and its usage on a 3 million-gate chip. The flow addresses the key shortcomings in conventional "bottom-up" and "top-down" hierarchical physical design flows.
NEC Electronics has developed an innovative, "virtually flat" hierarchical physical design flow for large cell-based IC (ASIC) designs. Monterey's IC Wizard design planner is a key tool for partitioning the design into pieces that can be handled easily by industry-standard physical synthesis and place and route tools.
Conventional flows approach hierarchical physical design with the functional/logical hierarchy view of the design: Even minor hierarchy modifications are avoided because of the perceived difficulties of functional verification and timing constraint modifications. Our virtually flat flow challenges those notions and addresses the shortcomings of the conventional hierarchical flows.
In conventional flows, layout tasks are performed at the block level and top level, thus requiring block-level model generation and multiple timing and physical views, depending on the tool set used. Further, block-level timing-model generation requires the block-level netlist to conform to certain rules in order for hierarchical static timing analysis (STA) to be accurate.
The physical interactions of top-level logic and block-level logic pose additional problems (for example, crosstalk between block-level nets and top-level nets routed over the block). Avoiding the problems by disallowing over-the-block routing causes layout restrictions and routing-channel overhead.
Our flow enables us to push top-level logic, including feed-through nets, into the partitions and implement it during partition layout. Top-level assembly is done by abutment, thereby reducing top-level layout to the trivial task of connecting the I/Os and partitions.
This flow does not need generation of block-level timing models; instead, constraints need to be generated for the physical partitions (Fig. 1). Partitions are forced to meet their respective constraints, and when assembled by abutment, the full-chip constraints are met automatically. Extra care is taken to specify precise constraints at the partition boundaries so that there are no surprises after assembly by abutment. Also, since routing channels are not needed, area overhead is reduced and detour routing is avoided.
The hierarchical design flow was used in reference to a design to validate the flow. The flow was completed on a 3 million-gate design with 57 RAM macros, two phase-locked loops (PLLs), a mainclock frequency of 100 MHz and 0.25-micron technology with five layers of aluminum interconnects. The same flow is being applied to 0.18- and 0.13-micron designs.
The most significant aspect of this flow is that planning and partitioning are completely automated. Automated top-down physical partitioning is the key enabler to realize compact layout as well as faster turnaround times in hierarchical layout flows (Fig. 2). Designers now have the ability to evaluate multiple floor plans and partitions and to optimize the floor plan to a much greater extent than with older-generation design planners.
We used a combination of commercial and internally developed tools: Monterey Design Systems' IC Wizard hierarchical design planner; industry-standard tools for STA, formal verification, physical synthesis, placement and routing; and NEC in-house tools for clock tree synthesis, RC extraction and sign-off delay calculation.
The full-chip logical hierarchy netlist is partially flattened to expose lower-level logical modules and hierarchies containing RAM macros. The process requires a review of the logical hierarchy and a decision about which instances to flatten. In the reference design, about 200 modules were promoted to the top level. That important step is required to give the floor planner sufficient granularity.
Next, the automatic floor plan is executed, placing all of the logic modules as if they were physical blocks. For the initial chip-level design plan, the design planner was able to place and shape all of the blocks in less than two hours. As part of the floor-planning step, the planner also determines the boundaries for the specified number of partitions, which are roughly equal in size. That maximizes the efficiency of the block-level implementation, especially if the blocks are implemented in parallel.
There are two alternative ways to handle PLLs, depending on power requirements. Digital PLLs, in particular, can be kept within a partition since they do not have special power requirements. Analog PLLs with special power requirements need to be promoted to the chip level and placed close to the appropriate power pad to have their own custom power routing.
The reference design had hierarchical blocks composed of RAM and standard-cell logic. Because of the functional association between standard-cell logic and RAM, we did not split that hierarchy; rather, we made such hierarchical blocks part of the initial chip-level floor plan and then placed the memory blocks and standard cells inside the hierarchical block. The relative memory locations were subsequently mapped to chip-level locations and exported for partition-level placement.
The power grid structure for the design was implemented using NEC Electronics' in-house tool. Only the grid structure (exact location of the power strips) was imported into the design planner, which constrained the locations where the partition boundaries could be cut and the pins placed.
For port placement, the design planner uses the image of global routing to determine the port location on the partition boundary. Nets are automatically partitioned; nets that cross the boundary between two contiguous partitions are cut at the boundary, and ports are added to both partitions. As a result, only adjoining partitions have connectivity.
Nets that connect noncontiguous partitions require feed-through nets. IC Wizard inserts a prespecified buffer as a place holder on the feed-through net, and timing budgets are established to drive accurate repeater insertion on those nets as part of physical synthesis and optimization. Partial flattening and partitioning of the netlist changes the logical hierarchy. In addition, the insertion of feed-through nets in the partition netlist makes formal verification necessary to ensure equivalence to the original netlist.
In our flow, formal verification is facilitated in a number of ways. The original logical hierarchy names are preserved on the flattened module instances. Also, logical subhierarchies are preserved within a partition. We keep the feed-through nets at the top level within a partition hierarchy, and we can use a consistent naming scheme for each partition.
We use an industry-standard STA tool to automatically partition the timing exceptions and create timing budgets for each physical partition. We then execute partition-level STA to resolve any issues with the partition-level constraints. After completing partition-level placement, we revisit and revise timing budgets as necessary.
Partition-level cell placement is driven by RAM and port location from the design planning and by constraints from the timing budgeting. A physical synthesis tool performs cell placement and afterward, the partitions are merged into a full-chip-placed DEF file.
The reference design has two clock domains: one driving 100k flip-flops and another driving 2k flip-flops. Flat chip-level clock tree synthesis is feasible for this design. A hierarchical clock tree methodology is also supported, as follows: A balanced top-level clock tree is built with partition clock ports as endpoints. The information is pushed into the partition so that the partition-level layout flow is aware of the preexisting clock tree layout. The balancing of the top-level tree is optionally driven by an estimated clock insertion delay within each physical partition. During partition-level layout, the partition clock tree is implemented with the estimated insertion delay as a constraint.
Clock skew challenges
Achieving good clock skew in conventional hierarchical flows is a challenge; the number of endpoints to be balanced may vary by one or two orders of magnitude between logical hierarchy blocks. If routing over the block is not allowed, top-level clock tree balancing has limited channel resources. Both issues are addressed by this flow. Physical partitioning results in a balanced number of flops between partitions. Top-level clock implementation allows the use of over-the-partition routing.
During the development of the hierarchical design flow, we discovered that we had not yet reached the limitation of the gate-level router. That discovery allowed us to route the design flat at the chip level. But we do not see any difficulty in doing partition-level routing and then assembling the partitions by abutment. Further, as mentioned previously, by specifying precise constraints such as expected transition times at the output and input ports and by placing buffers next to input ports, timing realized at the partition level is not altered at the top level.
After final routing, the design passed DRC verification, and NEC Electronics completed postlayout sign-off extraction, delay calculation and timing analysis.
---
Arun Balakrishnan and Wolfgang Roethig, design engineering managers, and Gopal Dandu, staff design engineer, at NEC Electronics Inc. (Santa Clara, Calif.), hold advanced degrees from Rutgers, France's ENST and India's Delhi College of Engineering, respectively. Benny Winefeld is with Monterey Design Systems (Sunnyvale, Calif.).
http://www.isdmag.com
Copyright © 2002 CMP Media LLC
6/1/02, Issue # 14156, page 32.